perf(napi/parser): optimize string deserialization for non-ASCII sources#20834
Conversation
|
Thanks very much! And thanks for all the info - appreciate the detail. I've been aware that string decoding is the largest bottleneck in raw transfer deserialization, but just not had time to get into it. Very much appreciate you digging in. A couple of questions:
I ask because I'm surprised at the conclusion that 9 is the optimal "switch over point". As the TODO comment noted, 50 was a guess, but it was based on some rough benchmarking. What I found was optimal previously was 50 +/- 20, so 9 is well outside that range. I can see a couple more changes we could make which will likely boost perf further, but they'll require more intrusive changes, so we can leave them to follow-on PRs. Just to say, this is a really good example of using AI fruitfully and intelligently - in stark contrast to the reams of slop PRs we and many other open source projects are being inundated with. Thank you! |
|
Oh wait, I know you! Hello Joshua! |
|
Also... do you have any more info on experiment 15? Sounds like 2 different things were tested at same time.
I'd be interested in teasing those 2 apart. I'd have expected |
|
Hey @overlookmotel! Haha, yeah, I didn't want to open the PR with "we know each other" but yeah, we do! The above experiments were on an x86 intel machine running Arch Linux (I use Arch btw). I also have access to a Mac book with an M4, happy to rerun on that to see if the value holds across architectures today, and report back. |
|
Ok, so I ran the benchmarks on an M4 mac locally. Used a standalone script against the same fixtures and deserialize path as
I also addressed your thoughts re: threshold size on the M4 by running everything at a few different sizes on this architecture, got this:
On M4, 8-15 are all within noise of each other (but significantly faster than threshold 50). The threshold choice doesn't seem to matter much on Apple Silicon in the 8-15 range. The big wins in this PR are from the |
Two changes to `deserializeStr` in the raw transfer JS deserializer: 1. Compute `firstNonAsciiPos` at deserialization start to enable the fast `sourceText.substr()` path for strings in the all-ASCII prefix of non-ASCII sources. Previously, a single non-ASCII character anywhere in the source disabled `substr` for all ~148K strings. 2. Lower the `TextDecoder` threshold from 50 to 9 bytes. For strings of 10+ bytes, `TextDecoder.decode()` is faster than the byte-by-byte `fromCodePoint` + string concatenation loop, which caused V8's `StringAdd_CheckNone` builtin to dominate the profile at 13.7%. Benchmarked across 5 real-world fixtures (checker.ts 2.9MB, cal.com.tsx 1MB, RadixUIAdoptionSection.jsx 3KB, pdf.mjs 554KB, antd.js 3.9MB): Before: 97.7ms total After: 76.4ms total (-22%)
1feff7f to
e53dbc0
Compare
overlookmotel
left a comment
There was a problem hiding this comment.
Great! I rebased on latest main and will merge assuming CI passes.
|
I've tested on a Mac Mini M4 Pro and the results hold up. I've put together a little benchmark repo which measures just I'm going to try a few things and see if can speed it up further. If you'd like to, please set your robot on it too! |
…rializeStr` (#21019) Follow-on after #20834. Simplify the branch condition in `deserializeStr` for detemining if can take the fast path of just slicing `sourceText`. There's no need to check `sourceIsAscii`, just compare the offset to `firstNonAsciiPos` (the position in buffer of first non-ASCII byte in source code). When source is 100% ASCII, `firstNonAsciiPos = sourceEndPos`, so `pos < firstNonAsciiPos` passes for all positions in source. The implementation is different for parser and for Oxlint, as the source text sits in a different location in buffer - at the start in parser, at the end in Oxlint - but the principle is the same in both. [Benchmarking](https://github.com/overlookmotel/oxc-raw-str-bench) showed this speeds up `deserializeStr` by a small percentage.
…ransfer (#21021) Improve perf of deserializing strings in raw transfer. This PR combines several optimizations, which have been tested and benchmarked in https://github.com/overlookmotel/oxc-raw-str-bench. This PR implements the version "latin-slice-onebyte64" from that repo, which is the current winner. String deserialization is the main bottleneck in raw transfer, so speeding it up will likely make a large impact on deserialization overall. This work follows on from #20834 which produced a major speed-up in many files by making files which contain some non-ASCII characters take the fast path of slicing `sourceText` more often. This PR tackles the remainder - speeding up the fallback path where the fast path can't be taken. ## Optimizations The optimizations in this PR are: ### Latin1 When source is not 100% ASCII, decode source text from buffer as Latin1. A Latin1-decoded string represents each UTF-8 byte as a single Latin1 character, so it can be indexed into using UTF-8 offsets. So when we can't slice the string from `sourceText` because the UTF-8 and UTF-16 offsets differ (after any non-ASCII character), loop through the string's bytes and check if they're all ASCII. If they are, the string can be sliced from `sourceTextLatin` instead, with the original UTF-8 offsets. This is way faster than calling `textDecoder.decode`, as it avoids a call into C++. [Benchmarks show](https://github.com/overlookmotel/oxc-raw-str-bench/blob/4f96275efa9a35d5d27615abb27f21a137149cc0/README.md#apply30-vs-latin-vs-latin-source64) speed up of 55% on average, and up to 70% on some files. ### Latin1 decoding method It turns out that `new TextDecoder("latin1").decode(arr)` doesn't actually decode to Latin1! Per the WHATWG Encoding Standard, "latin1" is mapped to "windows-1252". The result is that with `TextDecoder("latin1")`: 1. `decode` is quite complicated, requiring a 2-pass scan of the bytes to determine if they're all ASCII, followed by a 2nd pass to do the actual `windows-1252` decoding. If the string *does* contain any non-ASCII characters (which it always does in our usecase), NodeJS implements the decoding in JS, not native code. Slow. 2. `decode` produces a 2-byte-per-char string (`TWO_BYTE` in V8), which takes more memory, and is slower for all operations on it e.g. string comparison, hashing for use as an object key etc. Instead, use `Buffer.prototype.latin1Slice` which: 1. Does a pure Latin1 decode, which is just a single `memcpy` call. 2. Produces a 1-byte-per-char string (`ONE_BYTE` in V8). `latin1Slice` involves a call into C++, but we only do it once per file, so this cost is tiny in context of deserializing the whole AST. ### Latin1 string slicing In the fast path, slice from the Latin1-decoded string, instead of `sourceText`. In the fast path, we know that all bytes of source comprising the string are ASCII, so no further checks are required. This makes no difference on benchmarks for `deserializeStr` itself, but it may have beneficial effects downstream for code (e.g. lint rules) which access strings in the AST, e.g. `Identifier` names. Because Latin1-decoded source text is `ONE_BYTE`-encoded, slices of it are too. In comparison, slices of `sourceText` may be `ONE_BYTE` or `TWO_BYTE`. If a file's source is pure ASCII, it'll be `ONE_BYTE`, if source contains any non-ASCII characters, it'll be `TWO_BYTE`. Files in a repo will likely be a mix of both, which makes strings returned from `deserializeStr` and placed in the AST a mix too. This in turn makes functions (e.g. lint rule visitors) polymorphic. V8 cannot optimize them as aggressively as if they see only `ONE_BYTE` strings. We cannot make sure that all strings returned by `deserializeStr` are `ONE_BYTE`. Some string may contain non-ASCII characters, and they *have* to be represented in `TWO_BYTE` form. But we can minimize it - now only strings which *themselves* contain non-ASCII characters are `TWO_BYTE`, whereas before they would be if the source text as a whole contains a single non-ASCII byte. Code which accesses `Identifier` names, for example will exclusively see `ONE_BYTE` strings and will be more heavily optimized, because Unicode `Identifier`s are rarer than hen's teeth in real-world code. ### Remove string-concatenation loop Previously strings which are outside of source text were assembled byte-by-byte in a loop via concatenation. Instead, check that all the bytes are ASCII first, copy them into an array and pass that array to `String.fromCharCode` with `fromCharCode.apply(null, array)`. To avoid allocating a fresh array every time, hold a stock of arrays for all string lengths that this path can require, and reuse them. This is a variation on the approach that #20883 took, but without the massive switch. This produces much tighter assembly, and avoids regressing the fast path due making `deserializeStr` a very large function. Despite the complexity, and multiple operations, [this is up to 3x faster](https://github.com/overlookmotel/oxc-raw-str-bench/blob/4f96275efa9a35d5d27615abb27f21a137149cc0/README.md#apply30-vs-switch30) than the switch approach, and gives an average 30% speed-up. ### Increase native call threshold The above optimizations make the slow path much faster. This shifts the tipping point at which it's faster to make a native call to `TextDecoder.decode` from 9 bytes to 64 bytes. Most strings now avoid the native call and stay in JS code which is heavily optimized by Turbofan. The tipping point of 64 is something of a guesstimate. Benchmarking shows its in the right ballpark, but we could finesse it, and probably squeeze out another couple of %. ## Credit The Latin1 string technique was cooked up by @joshuaisaact in overlookmotel/oxc-raw-str-bench#1. All credit to him for this masterstroke which cracks the whole problem!
### 🐛 Bug Fixes - fc7f60c allocator: Revert changes to `get_current_chunk_footer_field_offset` (#20964) (overlookmotel) - 31316c8 semantic: Rebind class expressions before identifier checks (#20916) (camc314) ### ⚡ Performance - fb52383 napi/parser, linter/plugins: Clear buffers and source texts earlier (#21025) (overlookmotel) - 3b7dec4 napi/parser, linter/plugins: Use `utf8Slice` for decoding UTF-8 strings (#21022) (overlookmotel) - 012c924 napi/parser, linter/plugins: Speed up decoding strings in raw transfer (#21021) (overlookmotel) - 55e1e9b napi/parser, linter/plugins: Initialize vars as 0 (#21020) (overlookmotel) - c25ef02 napi/parser, linter/plugins: Simplify branch condition in `deserializeStr` (#21019) (overlookmotel) - 9f494c3 napi/parser, linter/plugins: Raw transfer use `String.fromCharCode` in string decoding (#21018) (overlookmotel) - 91cf105 allocator: Increase initial chunk size from 512B to 16KB (#20968) (overlookmotel) - cbc0c21 allocator: Add `#[cold]` to to error handling functions (#20967) (overlookmotel) - 0503a78 napi/parser, linter/plugins: Faster deserialization of `raw` fields (#20923) (overlookmotel) - a24f75e napi/parser: Optimize string deserialization for non-ASCII sources (#20834) (Joshua Tuddenham) ### 📚 Documentation - c78a57a syntax: Fix typo (#21044) (camc314) - f5e228f allocator: Fix typo in comment (#20972) (overlookmotel) - 7159d51 allocator: Improve doc comment examples for `vec2::Vec` (#20969) (overlookmotel) - b1da750 allocator, data_structures: Correct comments (#20966) (overlookmotel) Co-authored-by: Boshen <1430279+Boshen@users.noreply.github.com>
# Oxlint ### 💥 BREAKING CHANGES - 22ce6af oxlint/lsp: [**BREAKING**] Show/fix safe suggestions by default (#19816) (Sysix) ### 🚀 Features - 7a7b7b8 oxlint/lsp: Add source.fixAllDangerous.oxc code action kind (#20526) (bab) - 9cfe57e linter/unicorn: Implement prefer-import-meta-properties rule (#20662) (Irfan - ئىرفان) - 1edb391 linter/eslint: Implement `no-restricted-exports` rule (#20592) (Nicolas Le Cam) - 0f12bcd linter/react: Implement `hook-use-state` rule (#20986) (Khaled Labeb) - 1513a9f oxlint/lsp: Show note field for lsp diagnostic (#20983) (Sysix) - 7fdf722 linter/unicorn: Implement `no-useless-iterator-to-array` rule (#20945) (Mikhail Baev) - 39c8f2c linter/jest: Implement padding-around-after-all-blocks (#21034) (Sapphire) - ac39e51 linter/eslint-vitest-plugin: Prefer importing vitest globals (#20960) (Said Atrahouch) - 0b84de1 oxlint: Support allow option for prefer-promise-reject-errors (#20934) (camc314) - 23db851 linter/consistent-return: Move rule from nursery to suspicious (#20920) (camc314) - 9a27e32 linter/no-unnecessary-type-conversion: Move rule from nursery to suspicious (#20919) (camc314) - 1ca7b58 linter/dot-notation: Move rule from nursery to style (#20918) (camc314) - 73ba81a linter/consistent-type-exports: Move rule from nursery to style (#20917) (camc314) - b9199b1 linter/unicorn: Implement switch-case-break-position (#20872) (Mikhail Baev) - 3435ff8 linter: Implements `prefer-snapshot-hint` rule in Jest and Vitest (#20870) (Said Atrahouch) - 98510d2 linter: Implement react/prefer-function-component (#19652) (Connor Shea) - 871f9d9 linter: Implement no-useless-assignment (#15466) (Zhaoting Zhou) - 0f01fbd linter: Implement eslint/object-shorthand (#17688) (yue) ### 🐛 Bug Fixes - dd2df87 npm: Export package.json for oxlint and oxfmt (#20784) (kazuya kawaguchi) - 9bc77dd linter/no-unused-private-class-members: False positive with await expr (#21067) (camc314) - 60a57cd linter/const-comparisons: Detect equality contradictions (#21065) (camc314) - 2bb2be2 linter/no-array-index-key: False positive when index is passed as function argument (#21012) (bab) - 6492953 linter/no-this-in-sfc: Only flag `this` used as member expression object (#20961) (bab) - 9446dcc oxlint/lsp: Skip `node_modules` in oxlint config walker (#21004) (copilot-swe-agent) - af89923 linter/no-namespace: Support glob pattern matching against basename (#21031) (bab) - 64a1a7e oxlint: Don't search for nested config outside base config (#21051) (Sysix) - 3b953bc linter/button-has-type: Ignore `document.createElement` calls (#21008) (Said Atrahouch) - 8c36070 linter/unicorn: Add support for `Array.from()` for `prefer-set-size` rule (#21016) (Mikhail Baev) - c1a48f0 linter: Detect vitest import from vite-plus/test (#20976) (Said Atrahouch) - 5c32fd1 lsp: Prevent corrupted autofix output from overlapping text edits (#19793) (Peter Wagenet) - ca79960 linter/no-array-index-key: Move span to `key` property (#20947) (camc314) - 2098274 linter: Add suggestion for `jest/prefer-equality-matcher` (#20925) (eryue0220) - 6eb77ec linter: Allow default-import barrels in import/named (#20757) (Bazyli Brzóska) - 9c218ef linter/eslint-vitest-plugin: Remove pending fix status for require-local-test-context-for-concurrent-snapshot (#20890) (Said Atrahouch) ### ⚡ Performance - fb52383 napi/parser, linter/plugins: Clear buffers and source texts earlier (#21025) (overlookmotel) - 3b7dec4 napi/parser, linter/plugins: Use `utf8Slice` for decoding UTF-8 strings (#21022) (overlookmotel) - 012c924 napi/parser, linter/plugins: Speed up decoding strings in raw transfer (#21021) (overlookmotel) - 55e1e9b napi/parser, linter/plugins: Initialize vars as 0 (#21020) (overlookmotel) - c25ef02 napi/parser, linter/plugins: Simplify branch condition in `deserializeStr` (#21019) (overlookmotel) - 9f494c3 napi/parser, linter/plugins: Raw transfer use `String.fromCharCode` in string decoding (#21018) (overlookmotel) - 0503a78 napi/parser, linter/plugins: Faster deserialization of `raw` fields (#20923) (overlookmotel) - a24f75e napi/parser: Optimize string deserialization for non-ASCII sources (#20834) (Joshua Tuddenham) ### 📚 Documentation - af72b80 oxlint: Fix typo for --tsconfig (#20889) (leaysgur) - 70c53b1 linter: Highlight that tsconfig is not respected in type aware linting (#20884) (camc314) # Oxfmt ### 🚀 Features - 35cf6e8 oxfmt: Add node version hint for ts config import failures (#21046) (camc314) ### 🐛 Bug Fixes - dd2df87 npm: Export package.json for oxlint and oxfmt (#20784) (kazuya kawaguchi) - 9d45511 oxfmt: Propagate file write errors instead of panicking (#20997) (leaysgur) - 139ddd9 formatter: Handle leading comment after array elision (#20987) (leaysgur) - 4216380 oxfmt: Support `.editorconfig` `tab_width` fallback (#20988) (leaysgur) - d10df39 formatter: Resolve pending space in fits measurer before expanded-mode early exit (#20954) (Dunqing) - f9ef1bd formatter: Avoid breaking after `=>` when arrow body has JSDoc type cast (#20857) (bab) Co-authored-by: Boshen <1430279+Boshen@users.noreply.github.com>
AI Disclosure: Developed with Claude Code (Opus). The winning approach came out of an automated experiment loop (auto-claude) — I was looking for a tight feedback loop to test the tool on and the
TODO: Find best switch-over pointcomment indeserializeStrcaught my eye. 20 experiments, keep-or-revert on each (all 20 summarized in an expandable section at the bottom). All code reviewed and understood. Happy to close this if it's not useful or doesn't meet the bar.Why
I was profiling the raw transfer deserialization path (
node --profonchecker.ts) and noticedStringAdd_CheckNoneat 13.7% of time — the single hottest function. It comes from the byte-by-byteout += fromCodePoint(c)loop indeserializeStrwhensourceIsAsciiis false.The thing is,
sourceIsAsciiis false for almost everything. All 5 NAPI bench fixtures are non-ASCII.checker.tshas literally one Bengali character at position 2.1M out of 2.9M. That one character disables the fastsubstrpath for all ~148K strings.What
Two changes to the generator (
tasks/ast_tools/src/generators/raw_transfer.rs) — that's the only file with real changes. The 9 generated JS files in the diff are the mechanical output ofcargo run -p oxc_ast_tools.1.
firstNonAsciiPosscan at init — On non-ASCII sources, find the first non-ASCII byte once upfront. Strings ending before that position can still usesourceText.substr()since byte offsets equal char offsets in the ASCII prefix. Forchecker.tsthis covers 73% of the file, forpdf.mjs98%.2. Lower TextDecoder threshold from 50 to 9 — The existing TODO asked for the right switch-over point. Experimentally, 9 is the sweet spot:
TextDecoderbeats thefromCodePointconcat loop for strings of 10+ bytes, and the concat loop is still faster for very short strings where the native call overhead dominates.Benchmarked on the
complicated()test set (5 rounds of 30 iters, dropping round 1 for JIT warmup):Also verified across 15 files — the 10 non-ASCII files above plus 5 ASCII files (
react.development.js,binder.ts,moment.js,jquery.js,vue.js). ASCII files are unchanged (our code only touches the non-ASCII path). -16% across non-ASCII files, no regressions.References
TODO: Find best switch-over pointinSTR_DESERIALIZER_BODYI appreciate this is already a bloated PR description (sorry @overlookmotel) but given the fairly unusual approach I thought you might want to see a very short summary of all 20 experiments Claude clauded through:
The loop works like this: edit code, benchmark, keep if faster,
git reset --hardif not. Metric is total deserialization time across the benchmark corpus.Baseline: 97.7ms (5-file corpus, all non-ASCII)
switchon len for inlineString.fromCharCode(uint8[pos], ...)fromCharCode.applyat endfromCodePointwithString.fromCharCodein the loopfromCodePointbetter.fromCharCodeapproaches for short stringsBuffer.from().toString()instead of TextDecoderfirstNonAsciiPosonly (use substr before it, TextDecoder after, no loop)firstNonAsciiPos+ threshold 9 + keep the loopThree things I (Claude) learned:
fromCodePointloop for short strings (1-9 bytes) made things worse. The loop is genuinely good for that range.firstNonAsciiPosonly works when combined with the threshold change. On its own it hurts files where non-ASCII appears early (antd.js, cal.com.tsx).